First, Load the previous assignment.
Note: This loading approach was suggested by petrelharp (petrelharp, 2016).
The dataset I used was GSE120200.
Wild type human neurons were cultured that had the ASXL1 gene knocked out. Using various experiments, including injecting the neurons into chicken embryos, the knockout was shown to interfere with neural crest formation.
Figure 1: How cell lines for RNA-seq data were sourced and confirmed to be a proper model
Scheme illustrating the generation of human pluripotent stem cell lines carrying premature stop codon (PSC) mutations in ASXL1.
Scheme of human ASXL1 protein showing annotated domains (NR, nuclear receptor), and locations of mutations frequently reported in BOS patients (red tinted sector) and present in BOS-iPSC and ASXL1PSC/PSC/ASXL1PSC/+ hESC clones.
Expression of ASXL1 in BOS-iPSC lines and ASXL1PSC/PSC/ASXL1PSC/+ hESC clones relative to the respective iPSC/hESC control lines using primers targeting exon 4 (mean ± SEM, n ≥ 3 different clones/passages).
Sequences of reverse transcribed ASXL1 transcripts from ASXL1 mutant lines.
(and F) Representative blotting (n = 3–5 independent experiments) of ASXL1, using a monoclonal antibody raised against the N terminus, in an hESC line overexpressing a truncated ASXL1 variant (PB-ASXL1PSC) (E) and in human iPSC and hESC lines (F).
This figure (Matheus et al., 2019) shows how the cell lines were generated. Essentially it compares the expression and transcripts of the cultured cell model with those of patients with Bohring-Opitz Syndrome and the control, specifically with regards to ASXL1. We can see that the cultured cell model looks similar in results to that of the BOS patients and dissimilar from the control.
plotMDS(d, labels=rownames(samples),
col = c("darkgreen","blue")[factor(samples)])
Figure 2: MDS plot
Legend:
Green: Homozygous knockout of ASXL1 Blue: Wild type (non-truncated ASXL1)
This plot from A1 shows that WT clusters to one side while HOM clusters to another, and they should, because they are different samples of the same cell cultures.
As such, the model can be trivially defined by whether it is wild type or HOM. But does it follow poisson distribution or negative binomial distribution?
plotMeanVar(d, show.raw.vars = TRUE,
show.tagwise.vars=TRUE, NBline=TRUE,
show.ave.raw.vars = FALSE,show.binned.common.disp.vars = TRUE)
Figure 3: Mean Variance Plot
Legend: Blue line is negative binomial model, black line is poisson model.
Based on the above plot (from A1 as well) we should use a negative binomial model, eg. edgeR, since the distribution follows the negative binomial line.
Since we are using edgeR, we can use the DGEList and model from A1, and we might have to since DGEList takes in raw counts, not normalized counts. Drawback of this is that we will have to remap to HUGO symbols when making the thresholded list for overrepresentation analysis, and we might come across values that we previously filtered out because there was no symbol.
We only have two groups, so we should use exactTest, according to lecture 5. I will also try glmQLFTest as well. We set n to the number of total genes and a standard p value limit of 0.05. We use Benjamini-Hochberg correction because it is the default, and because I’m also going to use it for g:Profiler for the thresholded list analysis.
e_t <- exactTest(d, dispersion="auto") #the most complex dispersion in d
uncorrected_table_e<- e_t[["table"]]
uncorrected_table_top_e <- uncorrected_table_e[order(uncorrected_table_e$PValue, decreasing = FALSE),]
uncorrected_table_top_e <- uncorrected_table_top_e[uncorrected_table_top_e$PValue < 0.05,]
head(uncorrected_table_top_e)
tags_e <- topTags(e_t, n=15845, adjust.method="BH", p.value = 1)
#sort(e_t[[table]][[PValue]], decreasing=TRUE)
corrected_table_e = tags_e[["table"]]
head(corrected_table_e)
fit <- glmQLFit(d, model_design)
results <- glmQLFTest(fit, coef=2) #test samplesWT in model design so that null hypothesis is that wild type == HOM mutation
uncorrected_table_g <- results[["table"]]
uncorrected_table_top_g <- uncorrected_table_g[order(uncorrected_table_e$PValue, decreasing = FALSE),]
uncorrected_table_top_g <- uncorrected_table_top_g[uncorrected_table_top_g$PValue < 0.05,]
head(uncorrected_table_top_g)
tags_g <- topTags(results, n=15845, adjust.method="BH", p.value = 1)
corrected_table_g = tags_g[["table"]]
head(corrected_table_g)
As we can see, the top hits for both approaches are the same but the glmQLFit approach gives us 7220 results compared to the exactTest approach which gives 6803 results.
How many genes pass correction?
glmQLFit: 5248
exactTest: 5291
Using an approach suggested by Blighe (2018) I now create a volcano plot.
insignificant_g = corrected_table_g[which((corrected_table_g$logFC <= 1 & corrected_table_g$logFC >= -1) | corrected_table_g$FDR >= 0.05),]
upregulated_g = corrected_table_g[which(corrected_table_g$logFC > 1 & corrected_table_g$FDR < 0.05),]
downregulated_g = corrected_table_g[which(corrected_table_g$logFC < -1 & corrected_table_g$FDR < 0.05),]
plot(corrected_table_g$logFC, -log10(corrected_table_g$PValue), pch=20, main = "GLMQLFit Volcano", xlab = "log2(FC)", ylab= "-log10(Adjusted PValue)", col=rgb(0,0,0,0))
points(insignificant_g$logFC, -log10(insignificant_g$PValue), pch=20, col="black")
points(upregulated_g$logFC, -log10(upregulated_g$PValue), pch=20, col="green")
points(downregulated_g$logFC, -log10(downregulated_g$PValue), pch=20, col="red")
insignificant_e = corrected_table_e[which((corrected_table_e$logFC <= 1 & corrected_table_e$logFC >= -1) | corrected_table_g$FDR >= 0.05),]
upregulated_e = corrected_table_e[which(corrected_table_e$logFC > 1 & corrected_table_e$FDR < 0.05),]
downregulated_e = corrected_table_e[which(corrected_table_e$logFC < -1 & corrected_table_e$FDR < 0.05),]
plot(corrected_table_e$logFC, -log10(corrected_table_e$PValue), pch=20, main = "ExactTest Volcano", xlab = "log2(FC)", ylab= "-log10(Adjusted PValue)", col=rgb(0,0,0,0))
points(insignificant_e$logFC, -log10(insignificant_e$PValue), pch=20, col="black")
points(upregulated_e$logFC, -log10(upregulated_e$PValue), pch=20, col="green")
points(downregulated_e$logFC, -log10(downregulated_e$PValue), pch=20, col="red")
Figure 4, 5: Volcano plots for GLMQLFit and ExactTest approaches
Legend: Red is significantly downregulated in our model (2 fold change down or more), green is significantly upregulated (2 fold change up or more)
Note that for our model, the design is whether the sample is WT or not. therefore, red is actually upregulated in HOM while green is downregulated in HOM compared to WT.
Now we need to create a thresholded list for the second part of analysis. I will use glmQLFit results because the list is smaller.
corrected_g_gn <- merge(hgnc_table[,1:2], corrected_table_g, by.x = 1, by.y = 0)
head(corrected_g_gn)
corrected_g_gn[,"rank"] <- -log10(corrected_g_gn$PValue) * sign(corrected_g_gn$logFC) *-1
#use -1 to reverse the order so that the upregulated genes are actually upregulated...
corrected_g_gn <- corrected_g_gn[order(corrected_g_gn$rank),]
downregulated_gn <- corrected_g_gn$hgnc_symbol[which(corrected_g_gn$FDR < 0.05 & corrected_g_gn$logFC >0)] #our logFCs are backwards so swap the directions of the logFC comparison.
upregulated_gn <- corrected_g_gn$hgnc_symbol[which(corrected_g_gn$FDR < 0.05 & corrected_g_gn$logFC <0)]
head(upregulated_gn)
[1] "PPCS" "PTPA" "DHRS2" "SYT7" "RFWD3" "GFRA3"
head(downregulated_gn)
[1] "ZIC1" "ZIC4" "FOXP2" "SCN3A" "CSRNP3" "MIR924HG"
corrected_g_gn[corrected_g_gn$hgnc_symbol == downregulated_gn[1],]
corrected_g_gn[which(corrected_g_gn$hgnc_symbol == upregulated_gn[1] | corrected_g_gn$hgnc_symbol == upregulated_gn[length(upregulated_gn)]),]
write.table(x=upregulated_gn,
file="ASXL1_knockout_upreg.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_gn,
file="ASXL1_knockout_downreg.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
Major correctness test:
In the paper there is a supplementary table indicating what they got for top downregulated genes. The top hit for that in terms of P Value is ZIC1. Note we just called order normally, so it is sorted in increasing order. Hence, downregulated genes first, then upregulated genes. To confirm our approach, we should see ZIC1 near the top of the downregulated table and we should see that the bottom of the upregulated table should have the highest P values.
Based on the tables and console output above, we do indeed confirm the presence of ZIC1 and that the upregulated table is sorted by lower PValues first. This supports the results the original paper because it indicates that our analysis passed a small sanity test and that we got one of the same results in our analysis.
Heatmap
Using code from Lecture 5:
library(ComplexHeatmap)
library(circlize)
top_hits <- rownames(tags_g$table)[tags_g$table$FDR<0.05]
heatmap_matrix_tophits <- t(
scale(t(normalized_filtered_counts[which(rownames(normalized_filtered_counts) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
In this heatmap, the conditions clearly cluster together since in the dendrogram, WT are in a clade and HOM are in another clade. We kind of see a checkerboard pattern between the WT and HOM up and downregulated genes.
Which method did you choose and why?
g:Profiler(Raudvere et al., 2019), because we went through the workflow in class already. This was a very arbitrary choice.
What annotation data did you use and why? What version of the annotation are you using?
GO:BP, Reactome, WikiPathways, KEGG for fun
GO:BP – releases/2019-07-01
REAC – annotations: ensembl
classes: 2019-10-2
WP – 20190910
KEGG – KEGG FTP Release 2019-09-30
How many genesets were returned with what thresholds?
Upregulated list
Threshold: 15-10000
GO:BP – 224
KEGG – 6
REAC –37
WP - 0
Threshold: 15-200
GO:BP – 75
KEGG – 5
REAC –33
WP - 0
Downregulated list
Threshold: 15-10000
GO:BP – 689
KEGG – 61
REAC –70
WP - 34
Threshold: 15-200
GO:BP – 319
KEGG – 54
REAC –56
WP - 33
Combined list
Threshold: 15-10000
GO:BP – 554
KEGG – 44
REAC –13
WP - 7
Threshold: 15-200
GO:BP – 190
KEGG – 36
REAC –9
WP - 7
Figures 6-11: Screenshots from g:Profiler
Figure 6: Upregulated small list
Figure 7: Upregulated big list
Figure 8: Downregulated small list
Figure 9: Downregulated big list
Figure 10: Combined small list
Figure 11: Combined big list
Legend
How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
It seems that there is not much correlation between the two upregulated and downregulated lists. However, by combining both up and regulated lists, we seem to get more accurate results. For instance, in the wikipathways result for the combined list, we get “Neural Crest Differentiation” which was the focus of the study in the first place. This over-representation result actually supports the conclusion of the original paper, that the knockout of ASXL1 affects neural crest differentiation. However, the P-value isn’t the highest overall for the annotation dataset.
Blighe, K. (2018). Question: How to change colour of points in volcano plot by common genes? https://www.biostars.org/p/190779/
Davis, S., & Meltzer, P. (2007). GEOquery: A bridge between the gene expression omnibus (geo) and bioconductor. Bioinformatics, 14, 1846–1847.
Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A., & Huber, W. (2005). BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics, 21, 3439–3440.
Durinck, S., Spellman, P. T., Birney, E., & Huber, W. (2009). Mapping identifiers for the integration of genomic datasets with the r/bioconductor package biomaRt. Nature Protocols, 4, 1184–1191.
Gu, Z., Eils, R., & Schlesner, M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.
Gu, Z., Gu, L., Eils, R., Schlesner, M., & Brors, B. (2014). Circlize implements and enhances circular visualization in r. Bioinformatics, 30(19), 2811–2812.
Matheus, F., Rusha, E., Rehimi, R., Molitor, L., Pertek, A., Modic, M., Feederle, R., Flatley, A., Kremmer, E., Geerlof, A., & others. (2019). Pathological asxl1 mutations and protein variants impair neural crest development. Stem Cell Reports, 12(5), 861–868.
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor rna-seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297. https://doi.org/10.1093/nar/gks042
petrelharp. (2016). Knitr: Run all chunks in an rmarkdown document. https://stackoverflow.com/questions/24753969/knitr-run-all-chunks-in-an-rmarkdown-document
Raudvere, U., Kolberg, L., Kuzmin, I., Arak, T., Adler, P., Peterson, H., & Vilo, J. (2019). G: Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research, 47(W1), W191–W198.
R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). EdgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616